1 Workshop Description
This workshop demonstrates how users can use GDC ATAC-seq data in their analysis.
For more information about the data please visit GDC publication website (https://gdc.cancer.gov/about-data/publications/ATACseq-AWG) and read the paper:
- CORCES, M. Ryan, et al. The chromatin accessibility landscape of primary human cancers. Science, 2018, vol. 362, no 6413, p. eaav1898. https://doi.org/10.1126/science.aav1898
1.1 Pre-requisites
- Basic knowledge of R syntax
1.2 Workshop Participation
Students will have a chance to download ATAC-seq cancer-specific peaks from GDC and import to R. After, esophageal adenocarcinoma (ESAD) vs esophageal squamous cell carcinoma (ESCC) analysis is performed and the results as visualized as a volcano plot and a heatmap.
1.3 Goals and objectives
- Download and understand the ATAC-seq data
- Compare two different groups of samples ATAC-seq data
2 Data from the workshop
You can find the t.test result and the ESCA ATAC-seq Summarized Experiment object at google drive. They are used to plot the heatmap and the volcano plot. But if you follow the initial sections you should be able to create them.
3 Loading required R libraries
The libraries below are used in the workshop.
# to read txt files
library(readr)
# to transform data into GenomicRanges
library(GenomicRanges)
# other ones used to prepare the data
library(tidyr)
library(dplyr)
library(SummarizedExperiment)
# For the t.test loop
library(plyr)
# For easy volcano plot
library(TCGAbiolinks)
# For heatmap plot
library(ComplexHeatmap)
library(circlize)If one of them are not installed you can install them with BiocManager as shown below.
4 Data
The ATAC-seq data used in this workshop is available at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG
4.1 Understanding the peaks sets
There are mainly two types of ATAC-seq Counts Matrices raw and normalized which covers mainly two sets of peaks:
- “cancer type-specific peak set” containing all of the reproducible peaks observed in an individual cancer type. These peaks were observed in at least two samples with a score per million value \(>=5\)
- “pan-cancer peak set” representing reproducible peaks from all cancer types that could then be used for cross-cancer comparisons
If we check the both sets (Files downloaded from GDC: “All cancer type-specific peak sets. [ZIP]” and “Pan-cancer peak set. [TXT]”), the set of peaks “pan-cancer peak set” consists of \(~562K\) peaks, and it contains a subset of each “cancer type-specific peak set”. We show an example for Esophageal carcinoma (ESCA) below.
# ESCA specific peaks set
atac_esca <- readr::read_tsv("ATAC-seq_data/TCGA-ATAC_Cancer_Type-specific_PeakCalls/ESCA_peakCalls.txt")
head(atac_esca)## # A tibble: 6 x 8
## seqnames start end name score annotation percentGC percentAT
## <chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 chr1 1290095 1290596 ESCA_107 2.46 3' UTR 0.677 0.323
## 2 chr1 1291115 1291616 ESCA_108 2.59 3' UTR 0.703 0.297
## 3 chr1 1291753 1292254 ESCA_109 7.58 3' UTR 0.639 0.361
## 4 chr1 1440824 1441325 ESCA_160 4.47 3' UTR 0.659 0.341
## 5 chr1 1630188 1630689 ESCA_179 20.6 3' UTR 0.729 0.271
## 6 chr1 2030218 2030719 ESCA_341 15.6 3' UTR 0.619 0.381
## Number of peaks: 127055
# pan-cancer peak set
atac_pan <- readr::read_tsv("ATAC-seq_data/TCGA-ATAC_PanCancer_PeakSet.txt")
head(atac_pan)## # A tibble: 6 x 7
## seqnames start end name score annotation percentGC
## <chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 chr1 906012 906513 ACC_10 7.17 Intron 0.613
## 2 chr2 112541661 112542162 ACC_10008 22.0 Promoter 0.555
## 3 chr1 21673421 21673922 ACC_1001 6.46 Distal 0.509
## 4 chr2 112584205 112584706 ACC_10013 43.2 Promoter 0.587
## 5 chr2 112596243 112596744 ACC_10016 5.43 Intron 0.491
## 6 chr1 21725692 21726193 ACC_1002 5.20 Intron 0.405
# from the pancan set how many belongs to each cancer type?
table(stringr::str_split(atac_pan$name,"_",simplify = T)[,1])##
## ACC BLCA BRCA CESC CHOL COAD ESCA GBM HNSC KIRC KIRP LGG
## 29311 25337 49748 14358 11819 25404 13237 15394 16651 15067 24324 23836
## LIHC LUAD LUSC MESO PCPG PRAD SKCM STAD TGCT THCA UCEC
## 35787 23729 22195 22958 31372 30067 36591 17358 26120 31568 20478
##
## FALSE TRUE
## 113818 13237
##
## TRUE
## 13237
However, it is important to highlight that the “pan-cancer peak set” will keep the most significant peaks (hihghest score) for the overlapping peaks. Which means that the name in the “pan-cancer peak set” consists of the one with highest score. If we check the regions overlap of the ESCA peaks, we can see that the majority of the peaks are still within the PAN-can, but they are higher in another subtype
atac_esca.gr <- makeGRangesFromDataFrame(atac_esca,keep.extra.columns = T)
atac_pan.gr <- makeGRangesFromDataFrame(atac_pan,keep.extra.columns = T)
length(subsetByOverlaps(atac_esca.gr,atac_pan.gr))## [1] 126935
So let’s check an overlaping peak. The “ESCA_17603” peak is not within the pancan set of peaks, because it overlaps the “ACC_10008” peak, which has a higher score.
## [1] FALSE
## GRanges object with 1 range and 4 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr2 112541661-112542162 * | ACC_10008 22.03057903
## annotation percentGC
## <character> <numeric>
## [1] Promoter 0.55489022
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
## GRanges object with 1 range and 5 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr2 112541649-112542150 * | ESCA_17603 6.25798951741993
## annotation percentGC percentAT
## <character> <numeric> <numeric>
## [1] Promoter 0.55688622754491 0.44311377245509
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
Also it is important to note that the peaks size is the same.
## [1] 502
## [1] 502
In summary, in the pan-can set the ESCA peaks will be the ones that are the strongest when compared to the other cancer types. These are a subset of all ESCA peaks. So, if you are looking for all ATAC-seq ESCA peaks identified in at least two samples the cancer-specific set should be used.
5 Using ATAC-seq counts to compare two groups
Through the next section we will load the normalized counts data and compare two groups of samples, to identify which peaks are stronger in a given group compared to the other one.
The main two files used in this section are:
- Normalized ATAC-seq insertion counts within the pan-cancer peak set. Recommended format. [RDS]
- All cancer type-specific count matrices in normalized counts. [ZIP]
In the code below we are showing the beginning of the objects. It is important to highlight that the samples are using Stanford UUID instead of TCGA barcodes and each patient has two samples.
## # A tibble: 4 x 8
## seqnames start end name score ESCA_1E6AC686_9… ESCA_1E6AC686_9…
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 chr1 10180 10679 ESCA… 2.08 1.44 1.22
## 2 chr1 180641 181140 ESCA… 2.17 2.55 2.23
## 3 chr1 181193 181692 ESCA… 6.63 2.10 1.98
## 4 chr1 184246 184745 ESCA… 2.59 1.44 1.16
## # … with 1 more variable:
## # ESCA_1FEF5E19_2351_4D46_99B4_63F0637ABF17_X010_S08_L063_B1_T1_P016 <dbl>
We will change the samples names to TCGA barcodes using the file “Lookup table for various TCGA sample identifiers. [TXT]” from GDC.
file.samples.ids <- "TCGA_identifier_mapping.txt"
if(!file.exists(file.samples.ids)) downloader::download("https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c",file.samples.ids)
samples.ids <- readr::read_tsv(file.samples.ids)## Parsed with column specification:
## cols(
## bam_prefix = col_character(),
## stanfordUUID = col_character(),
## aliquot_id = col_character(),
## Case_UUID = col_character(),
## Case_ID = col_character()
## )
## # A tibble: 6 x 6
## bam_prefix stanfordUUID aliquot_id Case_UUID Case_ID sample
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 BRCA-000CFD9F-A… 000CFD9F-ADDF-… TCGA-A7-A13… 2cf68894-1… TCGA-A7… TCGA-…
## 2 BRCA-000CFD9F-A… 000CFD9F-ADDF-… TCGA-A7-A13… 2cf68894-1… TCGA-A7… TCGA-…
## 3 PCPG-007124EC-1… 007124EC-1F9B-… TCGA-RM-A68… 1a1cf490-8… TCGA-RM… TCGA-…
## 4 PCPG-007124EC-1… 007124EC-1F9B-… TCGA-RM-A68… 1a1cf490-8… TCGA-RM… TCGA-…
## 5 STAD-00DFAA4D-D… 00DFAA4D-DE64-… TCGA-BR-A4J… e9a98a44-8… TCGA-BR… TCGA-…
## 6 STAD-00DFAA4D-D… 00DFAA4D-DE64-… TCGA-BR-A4J… e9a98a44-8… TCGA-BR… TCGA-…
colnames(atac_esca_norm_ct)[-c(1:5)] <- samples.ids$Case_ID[match(gsub("_","-",colnames(atac_esca_norm_ct)[-c(1:5)]),samples.ids$bam_prefix)]
atac_esca_norm_ct[1:4,1:8]## # A tibble: 4 x 8
## seqnames start end name score `TCGA-IG-A51D-0… `TCGA-IG-A51D-0…
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 chr1 10180 10679 ESCA… 2.08 1.44 1.22
## 2 chr1 180641 181140 ESCA… 2.17 2.55 2.23
## 3 chr1 181193 181692 ESCA… 6.63 2.10 1.98
## 4 chr1 184246 184745 ESCA… 2.59 1.44 1.16
## # … with 1 more variable: `TCGA-LN-A9FQ-01A-11-A617-42` <dbl>
atac <- atac_esca_norm_ct
non.cts.idx <- 1:5
samples.info <- TCGAbiolinks:::colDataPrepare(unique(colnames(atac)[-c(non.cts.idx)]))
samples.map <- gsub(",| |NOS","",gsub("Adenocarcinoma","ESAD",gsub("Squamous cell carcinoma","ESCC",paste0(samples.info$primary_diagnosis,"-",samples.info$sample))))
colnames(atac)[-c(non.cts.idx)] <- samples.map[match(substr(colnames(atac)[-c(non.cts.idx)],1,16),substr(samples.map,6,21))]
# create SE object
counts <- atac[,-c(1:5)]
rowRanges <- makeGRangesFromDataFrame(atac)
rowRanges$score <- atac$score
rowRanges$name <- atac$name
names(rowRanges) <- paste(atac$name,atac$seqnames,atac$start,atac$end, sep = "_")
colData <- DataFrame(unique(left_join(samples.info,samples.ids)))
esca.rse <- SummarizedExperiment(assays=SimpleList(log2norm=as.matrix(counts)),
rowRanges = rowRanges,
colData = colData)
# Since we have two samples for each patient we will rename tham as rep1 and rep2
duplicated.idx <- duplicated(colnames(esca.rse))
colnames(esca.rse)[!duplicated.idx] <- paste0(colnames(esca.rse)[!duplicated.idx],"_rep1")
colnames(esca.rse)[duplicated.idx] <- paste0(colnames(esca.rse)[duplicated.idx],"_rep2")## class: RangedSummarizedExperiment
## dim: 126935 33
## metadata(0):
## assays(1): log2norm
## rownames(126935): ESCA_1_chr1_10180_10679
## ESCA_2_chr1_180641_181140 ...
## ESCA_126943_chrX_155985136_155985635
## ESCA_126944_chrX_156030008_156030507
## rowData names(2): score name
## colnames(33): ESCC-TCGA-IG-A51D-01A_rep1
## ESCC-TCGA-IG-A51D-01A_rep2 ... ESAD-TCGA-M9-A5M8-01A_rep1
## ESAD-TCGA-M9-A5M8-01A_rep2
## colData names(140): sample patient ... Case_UUID Case_ID
## [1] "ESCC-TCGA-IG-A51D-01A_rep1" "ESCC-TCGA-IG-A51D-01A_rep2"
## [3] "ESCC-TCGA-LN-A9FQ-01A_rep1" "ESCC-TCGA-LN-A9FQ-01A_rep2"
## [5] "ESAD-TCGA-L7-A6VZ-01A_rep1" "ESAD-TCGA-L7-A6VZ-01A_rep2"
## [7] "ESAD-TCGA-L5-A4OE-01A_rep1" "ESAD-TCGA-L5-A4OE-01A_rep2"
## [9] "ESCC-TCGA-LN-A49O-01A_rep1" "ESCC-TCGA-LN-A49O-01A_rep2"
## [11] "ESAD-TCGA-IG-A7DP-01A_rep1" "ESAD-TCGA-IG-A7DP-01A_rep2"
## [13] "ESCC-TCGA-LN-A4A5-01A_rep1" "ESCC-TCGA-LN-A4A5-01A_rep2"
## [15] "ESCC-TCGA-IG-A4P3-01A_rep1" "ESCC-TCGA-IG-A3YA-01A_rep1"
## [17] "ESCC-TCGA-IG-A3YA-01A_rep2" "ESAD-TCGA-IG-A4QS-01A_rep1"
## [19] "ESAD-TCGA-IG-A4QS-01A_rep2" "ESCC-TCGA-LN-A7HX-01A_rep1"
## [21] "ESCC-TCGA-LN-A7HX-01A_rep2" "ESCC-TCGA-LN-A8HZ-01A_rep1"
## [23] "ESCC-TCGA-LN-A4A2-01A_rep1" "ESCC-TCGA-LN-A4A2-01A_rep2"
## [25] "ESCC-TCGA-LN-A4MR-01A_rep1" "ESCC-TCGA-IG-A625-01A_rep1"
## [27] "ESCC-TCGA-IG-A625-01A_rep2" "ESCC-TCGA-LN-A49W-01A_rep1"
## [29] "ESCC-TCGA-LN-A49W-01A_rep2" "ESAD-TCGA-IC-A6RE-01A_rep1"
## [31] "ESAD-TCGA-IC-A6RE-01A_rep2" "ESAD-TCGA-M9-A5M8-01A_rep1"
## [33] "ESAD-TCGA-M9-A5M8-01A_rep2"
5.1 Comparing ESCC vs ESAD ATAC-seq
We will use a t-test to identify the peaks that have a siginificant different mean counts between the ESCC and ESAD samples.
escc.idx <- which(esca.rse$primary_diagnosis == "Squamous cell carcinoma, NOS")
esad.idx <- which(esca.rse$primary_diagnosis == "Adenocarcinoma, NOS")
result <- plyr::adply(assay(esca.rse),.margins = 1,.fun = function(peak){
results <- t.test(peak[escc.idx],peak[esad.idx],conf.level = TRUE)
return(tibble::tibble("raw_p_value"= results$p.value,
"ESCC_minus_ESAD" = results$estimate[1] - results$estimate[2]))
}, .progress = "time", .id = "peak")
result$FDR <- stats::p.adjust(result$raw_p_value,method = "fdr")5.2 Volcano plot of t-test analysis
We can plot the results to visualize the results and select a good cut-off.
fdr.cut.off <- 0.01
diff.cut.off <- 2
TCGAbiolinks:::TCGAVisualize_volcano(x = result$ESCC_minus_ESAD,
y = result$FDR,
title = paste0("Volcano plot - ATAC-seq peaks ",
"difference in ",
"ESCC vs ESAD\n"),
filename = NULL,
label = c("Not Significant",
paste0("High in ESCC (vs ESAD)"),
paste0("Low in ESCC (vs ESAD)")),
ylab = expression(paste(-Log[10],
" (FDR) [two tailed t-test] - cut-off FDR < ",fdr.cut.off
)),
xlab = expression(paste(
"Log2(Counts) difference - cut-off log2 delta(cts) > ",diff.cut.off
)),
x.cut = diff.cut.off,
y.cut = fdr.cut.off)# How many peaks pass our cut-offs
table(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off)##
## FALSE TRUE
## 122868 4067
5.3 Heatmap of differential significant peaks
First we will load the libraires used to plot the heatmap
# colors of the atac-seq data
pal_atac <- colorRampPalette(c('#3361A5',
'#248AF3',
'#14B3FF',
'#88CEEF',
'#C1D5DC',
'#EAD397',
'#FDB31A',
'#E42A2A',
'#A31D1D'))(100)
# Samples annotation
ha = HeatmapAnnotation(df = data.frame("Group" = esca.rse$primary_diagnosis,
"Replicate" = stringr::str_match(colnames(esca.rse),"rep[0-9]?")),
show_annotation_name = T,
col = list(Group = c("Squamous cell carcinoma, NOS" = "red",
"Adenocarcinoma, NOS" = "blue")),
show_legend = T,
annotation_name_side = "left",
annotation_name_gp = gpar(fontsize = 6))
plot.atac <- assay(esca.rse)[result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off,]
col <- colorRamp2(seq(min(plot.atac), max(plot.atac),
by = (max(plot.atac) - min(plot.atac))/99), pal_atac)
rows.annot <- rowAnnotation(foo = anno_mark(at = c(1,18), labels = rownames(plot.atac)[c(1,18)]))
ht_list <-
Heatmap(plot.atac,
name = "ATAC-seq log2(counts)",
col = col,
column_names_gp = gpar(fontsize = 8),
show_column_names = F,
heatmap_legend_param = list(legend_direction = "horizontal",
labels_gp = gpar(fontsize = 12),
title_gp = gpar(fontsize = 12)),
show_row_names = FALSE,
cluster_columns = TRUE,
use_raster = TRUE,
raster_device = c("png"),
raster_quality = 2,
cluster_rows = T,
right_annotation = rows.annot,
row_title = paste0(sum(result$FDR < fdr.cut.off &
abs(result$ESCC_minus_ESAD) > diff.cut.off),
" ATAC-seq peaks"),
#column_order = cols.order,
row_names_gp = gpar(fontsize = 4),
top_annotation = ha,
#width = unit(15, "cm"),
#column_title = paste0("RNA-seq z-score (n = ", ncol(plot.exp),")"),
column_title_gp = gpar(fontsize = 12),
row_title_gp = gpar(fontsize = 12))
draw(ht_list,newpage = TRUE,
column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,
", Diff mean log2 Count > ",diff.cut.off,")"),
column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "bottom",
annotation_legend_side = "right")5.4 Heatmap of differential significant peaks (z-score)
A better way to visualize a heatmap is using the z-score transformation on the rows. Z-scores are centered and normalized, so the user can interpret a color as x standard deviations from the mean and have an intuitive idea of the relative variation of that value. This will make the visibility of the heatmap better since it will reduce the range of the values plots. For more information, please read the discussion here
In R the function scale can be used, since it works by column we have to transpose the matrix so it is applied to the peaks instead of the samples and then transpose it back.
plot.atac.row.z.score <- t(scale(t(plot.atac))) # row z-score
col.zscore <- colorRamp2(seq(-2, 2, by = 4/99), pal_atac)
ht_list <-
Heatmap(plot.atac.row.z.score,
name = "Row z-score (ATAC-seq log2(counts))",
col = col.zscore,
column_names_gp = gpar(fontsize = 8),
show_column_names = F,
heatmap_legend_param = list(legend_direction = "horizontal",
labels_gp = gpar(fontsize = 12),
title_gp = gpar(fontsize = 12)),
show_row_names = FALSE,
cluster_columns = TRUE,
use_raster = TRUE,
right_annotation = rows.annot,
raster_device = c("png"),
raster_quality = 2,
cluster_rows = T,
row_title = paste0(sum(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off),
" ATAC-seq peaks"),
#column_order = cols.order,
row_names_gp = gpar(fontsize = 4),
top_annotation = ha,
#width = unit(15, "cm"),
#column_title = paste0("RNA-seq z-score (n = ", ncol(plot.exp),")"),
column_title_gp = gpar(fontsize = 12),
row_title_gp = gpar(fontsize = 12))
draw(ht_list,newpage = TRUE,
column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ",
fdr.cut.off,", Diff mean log2 Count > ",
diff.cut.off,")"),
column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "bottom",
annotation_legend_side = "right")5.5 Merging ATAC-seq replicates
If you want to instead of plot all replicates, to plot a single value for each patient you can get the mean of the values.
groupMeans <- function(mat, groups = NULL, na.rm = TRUE){
stopifnot(!is.null(groups))
gm <- lapply(unique(groups), function(x){
rowMeans(mat[,which(groups == x),drop = F], na.rm=na.rm)
}) %>% Reduce("cbind",.)
colnames(gm) <- unique(groups)
return(gm)
}
matMerged <- groupMeans(mat = assays(esca.rse)$log2norm, groups = colData(esca.rse)$sample)
# keep only metadata fro replicate 1
metadata <- colData(esca.rse)[grep("rep1",rownames(colData(esca.rse))),]
ha = HeatmapAnnotation(df = data.frame("Group" = metadata$primary_diagnosis),
show_annotation_name = T,
col = list(Group = c("Squamous cell carcinoma, NOS" = "red",
"Adenocarcinoma, NOS" = "blue")),
show_legend = T,
annotation_name_side = "left",
annotation_name_gp = gpar(fontsize = 6))
plot.atac <- matMerged[result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off,]
col <- colorRamp2(seq(min(plot.atac), max(plot.atac),
by = (max(plot.atac) - min(plot.atac))/99), pal_atac)
ht_list <-
Heatmap(plot.atac,
name ="ATAC-seq log2(counts)",
col = col,
column_names_gp = gpar(fontsize = 8),
show_column_names = F,
heatmap_legend_param = list(legend_direction = "horizontal",
labels_gp = gpar(fontsize = 12),
title_gp = gpar(fontsize = 12)),
show_row_names = FALSE,
cluster_columns = TRUE,
use_raster = TRUE,
raster_device = c("png"),
raster_quality = 2,
cluster_rows = T,
right_annotation = rows.annot,
row_title = paste0(sum(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off),
" ATAC-seq peaks"),
#column_order = cols.order,
row_names_gp = gpar(fontsize = 4),
top_annotation = ha,
#width = unit(15, "cm"),
#column_title = paste0("RNA-seq z-score (n = ", ncol(plot.exp),")"),
column_title_gp = gpar(fontsize = 12),
row_title_gp = gpar(fontsize = 12))
draw(ht_list,newpage = TRUE,
column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,",
Diff mean log2 Count > ",diff.cut.off,")"),
column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "bottom",
annotation_legend_side = "right")plot.atac.row.z.score <- t(scale(t(plot.atac))) # row z-score
col.zscore <- colorRamp2(seq(-2, 2, by = 4/99), pal_atac)
ht_list <-
Heatmap(plot.atac.row.z.score,
name = "Row z-score (ATAC-seq log2(counts))",
col = col.zscore,
column_names_gp = gpar(fontsize = 8),
show_column_names = F,
heatmap_legend_param = list(legend_direction = "horizontal",
labels_gp = gpar(fontsize = 12),
title_gp = gpar(fontsize = 12)),
show_row_names = FALSE,
cluster_columns = TRUE,
use_raster = TRUE,
right_annotation = rows.annot,
raster_device = c("png"),
raster_quality = 2,
cluster_rows = T,
row_title = paste0(sum(result$FDR < fdr.cut.off &
abs(result$ESCC_minus_ESAD) > diff.cut.off),
" ATAC-seq peaks"),
#column_order = cols.order,
row_names_gp = gpar(fontsize = 4),
top_annotation = ha,
#width = unit(15, "cm"),
#column_title = paste0("RNA-seq z-score (n = ", ncol(plot.exp),")"),
column_title_gp = gpar(fontsize = 12),
row_title_gp = gpar(fontsize = 12))
draw(ht_list,newpage = TRUE,
column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,
", Diff mean log2 Count > ",diff.cut.off,")"),
column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "bottom",
annotation_legend_side = "right")6 Other useful codes
6.1 Get all TCGA cancer-specific peaks peaks
The code below will download all cancer specific peaks and transform them into an R object.
library(readr)
library(GenomicRanges)
library(tidyr)
library(dplyr)
library(SummarizedExperiment)
library(plyr)
#-------------------------------
# Cancer specific peaks as SummaerizedExperiment object
#-------------------------------
zip.file <- "export.zip"
if(!file.exists(zip.file)) {
download("https://api.gdc.cancer.gov/data/38b8f311-f3a4-4746-9829-b8e3edb9c157",zip.file)
unzip(zip.file)
}
prepareCancerSpecificPeaks <- function(file){
output <- gsub(".txt",".rda",file)
if(file.exists(output)) return()
atac <- readr::read_tsv(file)
file.samples.ids <- "TCGA_identifier_mapping.txt"
if(!file.exists(file.samples.ids)) downloader::download("https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c",file.samples.ids)
samples.ids <- readr::read_tsv(file.samples.ids)
samples.ids$sample <- substr(samples.ids$Case_ID,1,16)
colnames(atac)[-c(1:5)] <- samples.ids$Case_ID[match(gsub("_","-",colnames(atac)[-c(1:5)]),samples.ids$bam_prefix)]
samples.info <- TCGAbiolinks:::colDataPrepare(unique(colnames(atac)[-c(1:5)]))
if(grepl("ESCA",file)){
samples.map <- gsub(",| |NOS","",gsub("Adenocarcinoma","ESAD",gsub("Squamous cell carcinoma","ESCC",paste0(samples.info$primary_diagnosis,"-",samples.info$sample))))
colnames(atac)[-c(1:5)] <- samples.map[match(substr(colnames(atac)[-c(1:5)],1,16),substr(samples.map,6,21))]
} else {
colnames(atac)[-c(1:5)] <- samples.info$sample[match(substr(colnames(atac)[-c(1:5)],1,16),samples.info$sample)]
}
# create SE object
counts <- atac[,-c(1:5)]
rowRanges <- makeGRangesFromDataFrame(atac)
rowRanges$score <- atac$score
rowRanges$name <- atac$name
names(rowRanges) <- paste(atac$name,atac$seqnames,atac$start,atac$end, sep = "_")
colData <- DataFrame(unique(left_join(samples.info,samples.ids)))
rse <- SummarizedExperiment(assays=SimpleList(log2norm=as.matrix(counts)),
rowRanges = rowRanges,
colData = colData)
save(rse,file = output, compress = "xz")
}
plyr::a_ply(dir(pattern=".txt"),1, function(f)
tryCatch({prepareCancerSpecificPeaks(f)}, error = function(e){message(e)}),
.progress = "time")6.2 ATAC-seq Bigwig
The ATAC-seq bigwig files available at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG
Here are some information about the bigwig files.
All bigWig files for each cancer type are compressed using tar and gzip. As such, each of the .tgz files contains all of the individual bigWig files for each technical replicate.
We recommend extracting the files using the following command: tar -zxvf file_name.tgz –strip-components 8 where the “–strip-components 8” extracts the files without copying their original directory structure
The provided bigWig files have been normalized by the total insertions in peaks and then binned into 100-bp bins. Each 100-bp bin represents the normalized number of insertions that occurred within the corresponding 100 bp.
The bigwig names also use Stanford UUIDs. The script below will help reanaming the bigwifiles with TCGA barcodes. First we get the path to the downloaded bigwig files after uncompressing them and read the information file from the ATAC-seq website.
bigwig.files <- dir(path = "ATAC-seq_data/ESCA_bigWigs/",
pattern = ".bw",
all.files = T,
full.names = T)
table <- readr::read_tsv("https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c")
head(table)Then, for each file we will map then in the downloaded table and rename them.
# rename bigwig files add more info in the nanem
plyr::a_ply(bigwig.files,1, function(file) {
file.uuid <- stringr::str_extract(file,
"[:alnum:]{8}_[:alnum:]{4}_[:alnum:]{4}_[:alnum:]{4}_[:alnum:]{12}")
idx <- grep(file.uuid,gsub("-","_",table$stanfordUUID))
barcode <- unique(table[idx,]$Case_ID)
if(grepl("ESCA",file)){
samples.info <- TCGAbiolinks:::colDataPrepare(barcode)
barcode <- gsub(",| |NOS","",
gsub("Adenocarcinoma","ESAD",
gsub("Squamous cell carcinoma","ESCC",
paste0(samples.info$primary_diagnosis,"-",
samples.info$sample)
)
)
)
}
# change UUID to barcode
to <- gsub(file.uuid,barcode,file)
file.rename(file, to)
})Since loading several bigwigs might be pretty slow in software like IGV users might want to reduce the bigwig files to a single chromossome (i.e. chr20). The Rscript below can do it by transforming the bigwig to a wig with only chr20 then converting the wig back to big wig.
You can downlaod at the executable forbigWigToWigand wigToBigWig at ENCODE (http://hgdownload.cse.ucsc.edu/admin/exe/) and the hg38.chrom.sizes is available at github (https://raw.githubusercontent.com/igvteam/igv/master/genomes/sizes/hg38.chrom.sizes)
chr <- 20
dirout <- paste0("chr",chr)
dir.create(dirout)
files <- dir(path = ".",pattern = "bw",full.names = T)
for(f in files){
f.in <- f
f.out <- gsub("bw","wig",f)
f.out.chr <- file.path(dirout,gsub("\\.bw",paste0("_chr",chr,".bw"),f))
cmd <- paste0("bigWigToWig -chrom=chr",chr," ", f.in," ", f.out)
system(cmd)
cmd <- paste0("wigToBigWig ", f.out," hg38.chrom.sizes ", f.out.chr)
system(cmd)
}6.2.1 Visualizing the bigwig in R
We upload some samples (chrmomssome 20 only) at the google drive that can be plotted in R using the karyoploteR pacakge.
## Loading required package: regioneR
pp <- getDefaultPlotParams(plot.type=1)
pp$leftmargin <- 0.15
pp$topmargin <- 15
pp$bottommargin <- 15
pp$ideogramheight <- 5
pp$data1inmargin <- 10
pp$data1outmargin <- 0
HNF4A.region <- toGRanges("chr20:44,350,700-44,450,596")
kp <- plotKaryotype(zoom = HNF4A.region,genome = "hg38", cex=0.5, plot.params = pp)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)## Loading required package: GenomicFeatures
## Loading required package: AnnotationDbi
##
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:dplyr':
##
## select
genes.data <- makeGenesDataFromTxDb(TxDb.Hsapiens.UCSC.hg38.knownGene,
karyoplot = kp,
plot.transcripts = TRUE,
plot.transcripts.structure = TRUE)
genes.data <- addGeneNames(genes.data)## Loading required package: org.Hs.eg.db
##
genes.data <- mergeTranscripts(genes.data)
kp <- plotKaryotype(zoom = HNF4A.region,genome = "hg38", cex=0.5, plot.params = pp)
kpAddBaseNumbers(kp, tick.dist = 10000, minor.tick.dist = 2000,
add.units = TRUE, cex = 0.5, tick.len = 3)
kpPlotGenes(kp, data = genes.data, r0 = 0, r1 = 0.25, gene.name.cex = 0.5)
big.wig.files <- dir(path = "ATAC-seq_data/ESCA_bigWigs/",
pattern = ".bw",
all.files = T,
full.names = T)
big.wig.files## [1] "ATAC-seq_data/ESCA_bigWigs//ESCA_ESAD-TCGA-L5-A4OE-01A_X015_S05_L033_B1_T1_P034.insertions_chr20.bw"
## [2] "ATAC-seq_data/ESCA_bigWigs//ESCA_ESAD-TCGA-L7-A6VZ-01A_X018_S02_L027_B1_T1_P040.insertions_chr20.bw"
## [3] "ATAC-seq_data/ESCA_bigWigs//ESCA_ESCC-TCGA-IG-A51D-01A_X012_S02_L027_B1_T1_P024.insertions_chr20.bw"
## [4] "ATAC-seq_data/ESCA_bigWigs//ESCA_ESCC-TCGA-LN-A49O-01A_X008_S08_L015_B1_T1_P017.insertions_chr20.bw"
## [5] "ATAC-seq_data/ESCA_bigWigs//ESCA_ESCC-TCGA-LN-A4A5-01A_X007_S06_L065_B1_T1_P012.insertions_chr20.bw"
## [6] "ATAC-seq_data/ESCA_bigWigs//ESCA_ESCC-TCGA-LN-A9FQ-01A_X010_S08_L063_B1_T1_P016.insertions_chr20.bw"
out.at <- autotrack(1:length(big.wig.files),
length(big.wig.files),
margin = 0.3,
r0 = 0.3,
r1 = 1)
kpAddLabels(kp,
labels = "ATAC-seq",
r0 = out.at$r0,
r1 = out.at$r1,
side = "left",
cex = 1,
srt = 90,
pos = 3,
label.margin = 0.1)
for(i in seq_len(length(big.wig.files))) {
bigwig.file <- big.wig.files[i]
at <- autotrack(i, length(big.wig.files), r0=out.at$r0, r1=out.at$r1, margin = 0.2)
kp <- kpPlotBigWig(kp,
data = bigwig.file,
ymax = "visible.region",
#ymax = 124,
r0 = at$r0,
col = ifelse(grepl("ESCC",bigwig.file),"#0000FF","#FF0000"),
r1 = at$r1)
computed.ymax <- ceiling(kp$latest.plot$computed.values$ymax)
kpAxis(kp,
ymin = 0,
ymax = computed.ymax,
numticks = 2,
r0 = at$r0,
r1 = at$r1,
cex = 0.5)
kpAddLabels(kp,
labels = ifelse(grepl("ESCC",bigwig.file),"ESCC","EAC"),
#labels = substr(basename(big.wig.files[i]),1,20),
r0 = at$r0,
r1 = at$r1,
cex = 1,
label.margin = 0.01)
}7 Session information
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] org.Hs.eg.db_3.8.2
## [2] TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.6
## [3] GenomicFeatures_1.36.3
## [4] AnnotationDbi_1.46.0
## [5] karyoploteR_1.10.2
## [6] regioneR_1.16.2
## [7] circlize_0.4.6
## [8] ComplexHeatmap_2.0.0
## [9] TCGAbiolinks_2.13.2
## [10] plyr_1.8.4
## [11] SummarizedExperiment_1.14.0
## [12] DelayedArray_0.10.0
## [13] BiocParallel_1.18.0
## [14] matrixStats_0.54.0
## [15] Biobase_2.44.0
## [16] dplyr_0.8.2
## [17] tidyr_0.8.3
## [18] GenomicRanges_1.36.0
## [19] GenomeInfoDb_1.20.0
## [20] IRanges_2.18.1
## [21] S4Vectors_0.22.0
## [22] BiocGenerics_0.30.0
## [23] readr_1.3.1
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.4 Hmisc_4.2-0
## [3] aroma.light_3.14.0 selectr_0.4-1
## [5] ConsensusClusterPlus_1.48.0 lazyeval_0.2.2
## [7] splines_3.6.0 ggplot2_3.2.0
## [9] sva_3.32.1 digest_0.6.19
## [11] ensembldb_2.8.0 foreach_1.4.4
## [13] htmltools_0.3.6 fansi_0.4.0
## [15] checkmate_1.9.3 magrittr_1.5
## [17] memoise_1.1.0 BSgenome_1.52.0
## [19] cluster_2.1.0 doParallel_1.0.14
## [21] limma_3.40.2 Biostrings_2.52.0
## [23] annotate_1.62.0 R.utils_2.9.0
## [25] prettyunits_1.0.2 colorspace_1.4-1
## [27] blob_1.1.1 rvest_0.3.4
## [29] ggrepel_0.8.1 xfun_0.8
## [31] crayon_1.3.4 RCurl_1.95-4.12
## [33] jsonlite_1.6 genefilter_1.66.0
## [35] zeallot_0.1.0 VariantAnnotation_1.30.1
## [37] survival_2.44-1.1 zoo_1.8-6
## [39] iterators_1.0.10 glue_1.3.1
## [41] survminer_0.4.4 gtable_0.3.0
## [43] zlibbioc_1.30.0 XVector_0.24.0
## [45] GetoptLong_0.1.7 shape_1.4.4
## [47] scales_1.0.0 DESeq_1.36.0
## [49] bezier_1.1.2 DBI_1.0.0
## [51] edgeR_3.26.5 ggthemes_4.2.0
## [53] Rcpp_1.0.1 htmlTable_1.13.1
## [55] xtable_1.8-4 progress_1.2.2
## [57] cmprsk_2.2-8 clue_0.3-57
## [59] foreign_0.8-71 bit_1.1-14
## [61] matlab_1.0.2 km.ci_0.5-2
## [63] Formula_1.2-3 htmlwidgets_1.3
## [65] httr_1.4.0 RColorBrewer_1.1-2
## [67] acepack_1.4.1 pkgconfig_2.0.2
## [69] XML_3.98-1.20 R.methodsS3_1.7.1
## [71] nnet_7.3-12 locfit_1.5-9.1
## [73] utf8_1.1.4 tidyselect_0.2.5
## [75] labeling_0.3 rlang_0.4.0
## [77] munsell_0.5.0 tools_3.6.0
## [79] downloader_0.4 cli_1.1.0
## [81] generics_0.0.2 RSQLite_2.1.1
## [83] broom_0.5.2 evaluate_0.14
## [85] stringr_1.4.0 yaml_2.2.0
## [87] knitr_1.23 bit64_0.9-7
## [89] survMisc_0.5.5 purrr_0.3.2
## [91] AnnotationFilter_1.8.0 EDASeq_2.18.0
## [93] nlme_3.1-140 R.oo_1.22.0
## [95] xml2_1.2.0 biomaRt_2.41.4
## [97] rstudioapi_0.10 compiler_3.6.0
## [99] curl_3.3 png_0.1-7
## [101] ggsignif_0.5.0 tibble_2.1.3
## [103] geneplotter_1.62.0 stringi_1.4.3
## [105] lattice_0.20-38 ProtGenerics_1.16.0
## [107] Matrix_1.2-17 KMsurv_0.1-5
## [109] vctrs_0.1.0 pillar_1.4.2
## [111] GlobalOptions_0.1.0 data.table_1.12.2
## [113] bitops_1.0-6 rtracklayer_1.44.0
## [115] R6_2.4.0 latticeExtra_0.6-28
## [117] hwriter_1.3.2 ShortRead_1.42.0
## [119] gridExtra_2.3 codetools_0.2-16
## [121] dichromat_2.0-0 assertthat_0.2.1
## [123] rjson_0.2.20 GenomicAlignments_1.20.1
## [125] Rsamtools_2.0.0 GenomeInfoDbData_1.2.1
## [127] mgcv_1.8-28 hms_0.4.2
## [129] rpart_4.1-15 prettydoc_0.2.1
## [131] bamsignals_1.16.0 rmarkdown_1.13
## [133] biovizBase_1.32.0 ggpubr_0.2.1
## [135] base64enc_0.1-3